# **EXHIBIT E**

# U.S. Patent No. 7,796,133 LG / MediaTek Products

"1. A unified shader comprising:"

1. A unified shader comprising:

The LG 49UH6500 television and X Power LS755 phone (collectively, the "LG Products") include a unified shader.



See http://www.lg.com/us/support-product/lg-49UH6500.

LG X power<sup>TM</sup> Boost Mobile<sup>®</sup>
LS755
Q ZOOM



See http://www.lg.com/us/cell-phones/lg-LS755-x-power-boost-mobile.

The LG Products include one of the following System-on-Chips (SoCs): M16 and MediaTek MT6755M.



*See* LG LED TV Service Manual, Chassis: UA63J, Model: 43UH6500, p.28, *available at* https://lg.encompass.com/shop/model\_research\_docs/?file=/ZEN/sm/43UH6500UB.pdf. 1/2

3

The LG 49UH6500 television and the LG 43UH6500 television are part of the LG UH6500 Series televisions. *See* http://www.lg.com/us/support/products/documents/UH6500\_Series\_Spec\_Sheet\_Updated\_10112016.pdf.

# Case 1:17-cv-00065-SLR Dosument 1/15. 7; led 01/28/11/2 | Page 5 of 67 PageID #: 109

"1. A unified shader comprising:"

# Technical Specifications

Carrier Boost Mobile®

Display 5.3" (1280 x 720) HD TFT Display

Battery 4,100 mAh non-removable

Platform Android 6.0.1 Marshmallow

Processor MediaTek 1.8 GHz Octa-Core MT6755M

See http://www.lg.com/us/cell-phones/lg-LS755-x-power-boost-mobile.

The SoCs include one of the following ARM Mali graphics processing units (the "Mali GPUs"): T760 MP2 and T860 MP2.

|                   |           | M16                       |
|-------------------|-----------|---------------------------|
|                   | CPU       | CA53 x4 1.1GHz /<br>1MB   |
|                   | GPU       | Mali T760 MP2<br>(650MHz) |
| O                 | OSD       | Separated 2K@60p          |
| Smart<br>Function | HEVC      | 4K @60,10bit              |
|                   | DDR       | DDR3-2133/<br>DDR4-2400   |
|                   | Audio DSP | HiFi3 Dual @370MHz        |

*See* LG LED TV Service Manual, Chassis: UA63J, Model: 43UH6500, p.123, *available at* https://lg.encompass.com/shop/model\_research\_docs/?file=/ZEN/sm/43UH6500UB.pdf.

# Case 1:17-cv-00065-SLR Dosument No. Filed Office Page 6 of 67 Page ID #: 110 "1. A unified shader comprising:"

|                           | MediaTek MT6755 Helio P10 Specs                                                                                                                                              |                              |  |
|---------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------|--|
|                           | Release                                                                                                                                                                      | Q4 2015                      |  |
|                           | Process                                                                                                                                                                      | 28nm                         |  |
|                           | Apps CPU                                                                                                                                                                     | 8x Cortex-A53, up to 2.0GHz  |  |
|                           | GPU                                                                                                                                                                          | ARM Mali-T860 MP2 at 700 MHz |  |
| See http://cnoemphone.com | See http://cnoemphone.com/blog/mediatek-mt6755-helio-p10-specs-benchmark-and-smartphone-list.  The Mali GPUs share substantially similar structure, function, and operation. |                              |  |
| The Mali GPUs share subs  |                                                                                                                                                                              |                              |  |



"1. A unified shader comprising:"



See http://www.arm.com/products/multimedia/mali-gpu/high-performance/mali-t860-t880.php.

#### **GPU** Architecture

The "Midgard" family of Mali GPUs (the Mali-T600 and Mali-T700 series) use a unified shader core architecture, meaning that only a single type of shader core exists in the design. This single core can execute all types of programmable shader code, including vertex shaders, fragment shaders, and compute kernels.

*See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.

## Case 1:17-cv-00065-SLR Dosument 1/5 Filed 0:1/23/117 Page 9 of 67 PageID #: 113

"an input interface for receiving a packet from a rasterizer;"

an input interface for receiving a packet from a rasterizer;

The LG Products include an input interface for receiving a packet from a rasterizer.

For example, "[t]he Mali shader core is structured as a number of fixed-function hardware blocks wrapped around a programmable tri-pipe execution core. The fixed function units perform the setup for a shader operation - such as rasterizing triangles or performing depth testing - or handling the post-shader activities - such as blending, or writing back a whole tile's worth of data at the end of rendering. The tripipe itself is the programmable part responsible for the execution of shader programs." *See* 

https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core

#### Shader Core Architecture



ARM

*See* Ryan Smith, ARM's Mali Midgard Architecture Explored, http://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored/4.

## Case 1:17-cv-00065-SLR Document 1.5, Filed 01/23/117m Page 10 of 67 PageID #: 114

"an input interface for receiving a packet from a rasterizer;"

6 Advanced Graphics Techniques 6.1 Custom shaders

#### Primitive assembly

In primitive assembly the vertices are assembled into geometric primitives. The resulting primitives are clipped to a clipping volume and sent to the rasterizer.

#### Rasterization

Output values from the vertex shader are calculated for every generated fragment. This process is known as interpolation. During rasterization, the primitives are converted into a set of two-dimensional fragments that are then sent to the fragment shader.

#### Transform feedback

Transform feedback, enables writing selective writing to an output buffer that the vertex shader outputs and is later sent back to the vertex shader. This feature is not exposed by Unity but it is used internally, for example, to optimize the skinning of characters.

### Fragment shader

The fragment shader implements a general-purpose programmable method for operating on fragments before they are sent to the next stage.

#### Per-fragment operations

In Per-fragment operations several functions and tests are applied on each fragment: pixel ownership test, scissor test, stencil and depth tests, blending and dithering. As a result of this per-fragment stage either the fragment is discarded or the fragment color, depth or stencil value is written to the frame buffer in screen coordinates.

See "ARM Guide to Unity" Version 2.1 page 6-77 available at http://infocenter.arm.com/help/topic/com.arm.doc.100140\_0201\_00\_en/arm\_guide\_to\_unity\_enhancing\_your\_mobile\_games\_100140\_0201\_00\_en.pdf (accessed 10/27/2016)

## Case 1:17-cv-00065-SLR Document 1.50 Filed 03/23/17m Page 11 of 67 PageID #: 115

"a shading processing mechanism configured to produce a resultant value from said packet by performing one or more shading operations,"

a shading processing mechanism configured to produce a resultant value from said packet by performing one or more shading operations, The LG Products include a shading processing mechanism configured to produce a resultant value from a rasterizer packet by performing one or more shading operations.

For example, the Mali GPU has three classes of pipelines "in the tripipe design: one handling arithmetic operations, one handling memory load/store and varying access, and one handling texture access. There is one load/store and one texture pipe per shader core, but the number of arithmetic pipelines can vary." *See* http://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored/4.



*See* http://malideveloper.arm.com/downloads/ARM\_Game\_Developer\_Days/PDFs/2-Mali-GPU-architecture-overview-and-tile-local-storage.pdf.

# Case 1:17-cv-00065-SLR Decympent 1.50. Filed, 93.62. 2/13m Page 12 of 67 PageID #: 116

"a shading processing mechanism configured to produce a resultant value from said packet by performing one or more shading operations,"



mali-midgard-architecture-explored/4.

# Case 1:17-cv-00065-SLR Document 1.5. Filed, 93/2-2/11/2m Page 13 of 67 PageID #: 117

"a shading processing mechanism configured to produce a resultant value from said packet by performing one or more shading operations,"



See https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machinepart-3--the-shader-core.

## Case 1:17-cv-00065-SLR Document 1.50 Filed 01/23/117m Page 14 of 67 PageID #: 118

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"

wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations The LG Products include shading operations comprising of both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations.

In particular, the shading operations comprise texture operations. For example, the SC includes a texture pipeline, load/store pipeline and a plurality of arithmetic pipelines. The "texture pipeline (T-pipe) is responsible for all memory access to do with textures. The texture pipeline can return one bilinear filtered texel per clock; trilinear filtering requires us to load samples from two different mipmaps in memory." *See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.



See http://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored/4.

## Case 1:17-cv-00065-SLR Document 1.50 Filed 01/23/117m Page 15 of 67 PageID #: 119

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"

#### 7.2.2 About Mali GPU architectures

Mali GPUs use a SIMD architecture. Instructions operate on multiple data elements simultaneously.

The peak throughput depends on the hardware implementation of the Mali GPU type and configuration.

The Mali GPUs contain 1 to 16 identical shader cores. Each shader core supports up to 384 concurrently executing threads.

Each shader core contains:

- · One to four arithmetic pipelines.
- · One load-store pipeline.
- One texture pipeline.

See "ARM Guide to Unity" Version 2.1 page 7-64 available at http://infocenter.arm.com/help/topic/com.arm.doc.100140\_0201\_00\_en/arm\_guide\_to\_unity\_enhancing\_your\_mobile\_games\_100140\_0201\_00\_en.pdf (accessed 10/27/2016).

The arithmetic pipeline ("ALU") performs texture operations. For example, the "ALU pipeline can read/write to 32 128-bit registers" including "texture pipeline results" from the texture pipe.

#### Registers

The ALU pipeline can read/write to 32 128-bit registers, which can be divided into 4 32-bit (highp in GLSL) components (one vec4) or 8 16-bit (mediump) components (two vec4's). Some of the registers, however, are dedicated to special purposes (see below) and are read-only or write-only.

#### **Special Registers**

```
r24 - can mean "unused" for 1-src instructions, or a pipeline register
r26 - inline constant
r27 - load/store offset when used as output register
r28-r29 - texture pipeline results
r31.w - conditional select input when written to in scalar add ALU
```

r0 - r23 is divided into two spaces: work registers and uniform registers. A configurable number of registers can be devoted to each; if there are N uniform registers, then r0 - r(23-N) are work registers and r(24-N)-r23 are uniform registers.

See http://limadriver.org/T6xx+ISA/.

The ALU also performs color operations. For example, the "Mali [GPU] only has to write the color data for a

## Case 1:17-cv-00065-SLR Document 1.5. Filed 01/23/117m Page 16 of 67 PageID #: 120

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"

single tile back to memory at the end of the tile." *See* https://community.arm.com/groups/arm-maligraphics/blog/2014/02/20/the-mali-gpu-an-abstract-machine-part-2.



*See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/02/20/the-mali-gpu-an-abstract-machine-part-2.



 $See \ https://community.arm.com/groups/arm-mali-graphics/blog/2012/08/17/how-low-can-you-go-building-low-power-low-bandwidth-arm-mali-gpus.$ 

Moreover, the ALU is responsible for performing the "[m]ath in the shaders[.]"

## Case 1:17-cv-00065-SLR Document 1.5. Filed 01/23/117m Page 17 of 67 PageID #: 121

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"

# ARM® Mali™-T628 GPU Tripipe

## Tripipe Cycles

- Arithmetic instructions
  - Math in the shaders
- Load & Store instructions
  - Uniforms, attributes and varyings
- Texture instructions
  - Texture sampling and filtering
- Instructions can run in parallel
- Each one can be a bottleneck
- There are two arithmetic pipelines so we should aim to increase the arithmetic workload



AKI

See http://malideveloper.arm.com/downloads/GDC14/Weds/11.15amStreamlineMaliHWCounters.pdf.

Additionally, "there are three classes of execution pipeline in the tripipe design: one handling arithmetic operations, one handling memory load/store and varying access, and one handling texture access. There is one load/store and one texture pipe per shader core, but the number of arithmetic pipelines can vary depending on which GPU you are using; most silicon shipping today will have two arithmetic pipelines." *See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.

*See* http://malideveloper.arm.com/downloads/ARM\_Game\_Developer\_Days/PDFs/2-Mali-GPU-architecture-overview-and-tile-local-storage.pdf.

# Case 1:17-cv-00065-SLR Document 1.5. Filed, 93/2-2/117m Page 18 of 67 PageID #: 122

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"



# Case 1:17-cv-00065-SLR Document 15. Filed, 9342-9/147m Page 19 of 67 PageID #: 123

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"

| 1.1      | About Mali GPUs                                                                                                                                                                                                                                                                            |
|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|          | ARM produces the following families of Mali GPUs:                                                                                                                                                                                                                                          |
|          | Mali Midgard GPUs  Mali-T600 series.  Mali-T720.  Mali-T760.  Mali-T820.  Mali-T830.  Mali-T830.  Mali-T860.  Mali-T880.                                                                                                                                                                   |
|          | Mali Utgard GPUs  The Mali Utgard GPUs include the following:  • Mali-300.  • Mali-400 MP.  • Mali-450 MP.  • Mali-470 MP.  —— Note ———  The Mali Utgard GPUs do not support OpenCL.                                                                                                       |
|          | Mali GPUs can have one or more shader cores. Each shader core contains one or more <i>Arithmetic Logic Units</i> (ALUs).                                                                                                                                                                   |
|          | The ALUs are based on a Single Instruction Multiple Data (SIMD) architecture. Instructions operate on multiple data elements simultaneously.                                                                                                                                               |
|          | Mali GPUs run data processing tasks in parallel that contain relatively little control code. Mali GPUs typically contain many more processing units than application processors. This enables Mali GPUs to compute at a higher rate than application processors, without using more power. |
| http://i | ARM Guide to Unity" Version 2.1 page 1.1 available at nfocenter.arm.com/help/topic/com.arm.doc.100140_0201_00_en/arm_guide_to_unity_enhancing_you games 100140 0201 00 en.pdf (accessed 10/27/2016).                                                                                       |

# Case 1:17-cv-00065-SLR Document 15. Filed 93/23/117m Page 20 of 67 PageID #: 124

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"



See ARM, How to Optimize Your Mobile Game with ARM Tools and Practical Examples, p.33, http://malideveloper.arm.com/downloads/GDC15/How%20to%20Optimize%20Your%20Mobile%20Game%20with%20ARM%20Tools%20and%20Practical%20Examples.pdf.

Furthermore, the unified shader comprises at least one ALU/memory pair. For example, as depicted below, each shader core is paired up with shared memory.

# Case 1:17-cv-00065-SLR Decympent 1.50. Filed, 93.62. 2/13m Page 21 of 67 PageID #: 125

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"



## Case 1:17-cv-00065-SLR Document 1.50 Filed 01/23/117m Page 22 of 67 PageID #: 126

"wherein texture operations comprise at least one of: issuing a texture request to a texture unit and writing received texture values to the memory and."

wherein texture operations comprise at least one of: issuing a texture request to a texture unit and writing received texture values to the memory and. The texture operations comprise at least one of: issuing a texture request to a texture unit and writing received texture values to the memory.

For example, as depicted below, the "thread pool" issues texture packets to the "texture pipeline." Moreover, the "texture pipeline (T-pipe) is responsible for all memory access to do with textures. The texture pipeline can return one bilinear filtered texel per clock; trilinear filtering requires us to load samples from two different mipmaps in memory, so requires a second clock cycle to complete." *See* 

https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.



*See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.

## Case 1:17-cv-00065-SLR Document 1.50 Filed 01/23/117m Page 23 of 67 PageID #: 127

"wherein texture operations comprise at least one of: issuing a texture request to a texture unit and writing received texture values to the memory and."

#### **Texture Pipeline**

The texture pipeline (T-pipe) is responsible for all memory access to do with textures. The texture pipeline can return one bilinear filtered texel per clock; trilinear filtering requires us to load samples from two different mipmaps in memory, so requires a second clock cycle to complete.

*See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core

#### The Texture pipeline

Texture accesses use cycles in the Texture pipeline and use memory bandwidth. Using large textures can be detrimental because cache misses are more likely and this can cause multiple threads to stall while waiting for data.

To improve the performance of the Texture pipeline try the following:

#### Use mipmaps

Mipmaps increase the cache hit rate because it selects the best resolution of the texture to use based on the variation of texture coordinates.

#### Use texture compression

This is also good for reducing the memory bandwidth and increasing the cache hit rate. Each compressed block contains more than one texel, so accessing it makes it more cacheable.

#### Avoid trilinear or anisotropic filtering

Trilinear and anisotropic filtering increase the number of operations required to fetch texels. Avoid using these techniques unless you absolutely require them.

See "ARM Guide to Unity" Version 2.1 page 1-6 available at http://infocenter.arm.com/help/topic/com.arm.doc.100140\_0201\_00\_en/arm\_guide\_to\_unity\_enhancing\_your\_mobile\_games\_100140\_0201\_00\_en.pdf (accessed 10/27/2016)

## Case 1:17-cv-00065-SLR Document 1.5, Filed 01/23/117m Page 24 of 67 PageID #: 128

"wherein texture operations comprise at least one of: issuing a texture request to a texture unit and writing received texture values to the memory and."

# Images and Compute

Another way to update textures

- Compute shaders mandate image load/store operations
  - These have been optional in other shader stages
- Allow random read/write access to a texture bound as an image sampler
  - Use image\*D as shader sampler type
- Layer parameters control whether a single image, or an entire level is made accessible
  - Think texture array or 3D textures

```
// Setup
glGenTextures( ... );
glBindTexture( GL_TEXTURE_2D, texId );
glTextureStorage2D( GL_TEXTURE_2D, levels,
    format, width, height );
glBindImageTexture( unit, texId, Layered,
    Layer, GL_READ_WRITE, GL_RGBA32F );

// Update
glUseProgram( compute );
glDispatchCompute( ... );
glMemoryBarrier( GL_SHADER_STROAGE_BARRIER_BIT );

// Use
glUseProgram( render );
glDrawArrays( ... );
```

ARM

 $\textit{See}\ \text{http://www.gdcvault.com/play/1020140/Getting-the-Most-Out-of.}$ 

# Case 1:17-cv-00065-SLR Document 15. Filed, 93/23/117m Page 25 of 67 PageID #: 129

"wherein texture operations comprise at least one of: issuing a texture request to a texture unit and writing received texture values to the memory and." **CPU** Vertices **Textures** Uniforms Memory **Textures** Triangles Vertices Uniforms Uniforms **Varyings Varyings** Vertex Fragment Shader Shader See ARM, ARM Tools Part 2, Best Optimization Practices for Mobile Platforms, p.13, available at http://malideveloper.arm.com/downloads/ARM\_Game\_Developer\_Days/PDFs/6%20-%20ARM%20Tools%20Part%202-

%20Best%20Optimization%20Practices%20for%20Mobile%20Platforms.pdf.

Furthermore, the received texture values are written into memory.

## Case 1:17-cv-00065-SLR Document 1.5. Filed 01/23/117m Page 26 of 67 PageID #: 130

"wherein texture operations comprise at least one of: issuing a texture request to a texture unit and writing received texture values to the memory and."

#### Registers

The ALU pipeline can read/write to 32 128-bit registers, which can be divided into 4 32-bit (highp in GLSL) components (one vec4) or 8 16-bit (mediump) components (two vec4's). Some of the registers, however, are dedicated to special purposes (see below) and are read-only or write-only.

#### **Special Registers**

```
r24 - can mean "unused" for 1-src instructions, or a pipeline register
r26 - inline constant
r27 - load/store offset when used as output register
r28-r29 - texture pipeline results
r31.w - conditional select input when written to in scalar add ALU
```

r0 - r23 is divided into two spaces: work registers and uniform registers. A configurable number of registers can be devoted to each; if there are N uniform registers, then r0 - r(23-N) are work registers and r(24-N)-r23 are uniform registers.

See http://limadriver.org/T6xx+ISA/.

# Case 1:17-cv-00065-SLR Decympent 1.5. Filed, 9362-9117m Page 27 of 67 PageID #: 131

"wherein the at least one ALU is operative to read from and write to the memory to perform both texture and color operations; and"

| wherein the at least one ALU is operative  | The LG Products include at least one ALU that is operative to read from and write to the memory to perform |
|--------------------------------------------|------------------------------------------------------------------------------------------------------------|
| to read from and write to the memory to    | both texture and color operations.                                                                         |
| perform both texture and color operations; |                                                                                                            |
| and                                        | For example, the ALU is designed to "strike a closer balance between shading and texturing."               |
|                                            |                                                                                                            |

## Case 1:17-cv-00065-SLR Document 1.50. Filed 01/23/117m Page 28 of 67 PageID #: 132

"wherein the at least one ALU is operative to read from and write to the memory to perform both texture and color operations; and"



As we've stated before, for our purposes we're primarily looking at the Mali-T760. On the T760 ARM uses 2 ALU blocks per tri pipe, which is the most common configuration that you will see for Midgard. However ARM also has Midgard designs that have 1 ALU block or 4 ALU blocks per tri pipe, which is one of the reasons why seemingly similarly GPUs such as T760, T720, and T678 can look so similar and yet behave so differently.

| ARM Mali Midgard Arithmetic Pipeline Count (Per Core) |   |  |
|-------------------------------------------------------|---|--|
| T628                                                  | 2 |  |
| T678                                                  | 4 |  |
| T720                                                  | 1 |  |
| T760                                                  | 2 |  |

Without being fully exhaustive, among various Midgard designs T628 and T760 are 2 ALU designs, while T720 is a 1 ALU design, and T678 is a 4 ALU design.

As one would expect, the different number of arithmetic pipelines per tri pipe has a knock-on effect on performance in all aspects, due to the changing ratio between the number of arithmetic pipelines and the number of load/store units and texture units. T678, for example, would be fairly shader-heavy, whereas the 2 ALU designs strike a closer balance between shading and texturing. Among the various Midgard designs ARM has experimented with several configurations, and with the T700 series they have settled on 2 ALU designs for the high-end T760 and 1 ALU for the mid-range T720 (although ARM likes to point out that T720 has some further optimizations just for this 1 ALU configuration).

 $\textit{See} \ \text{http://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored/4}.$ 

# Case 1:17-cv-00065-SLR Document 15. Filed, 93/2-2/117m Page 29 of 67 PageID #: 133

"wherein the at least one ALU is operative to read from and write to the memory to perform both texture and color operations; and"

Further, the ALUs are operative to read from and write to memory to perform texture and color operations as shown below. Tilelist Reader Rasterizer **Early ZS Testing** Fragment Thread Creator **Vertex Thread** Creator **Thread Pool** Load/Store Pipeline L1 Cache Texture Pipeline L1 Cache **Thread Retire Tripipe** Late ZS Testing Blending **Tile Memory** Tile Writeback See https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machinepart-3--the-shader-core.

# Case 1:17-cv-00065-SLR Document 1.50. Filed 03/23/17m Page 30 of 67 PageID #: 134

"wherein the at least one ALU is operative to read from and write to the memory to perform both texture and color operations; and"



*See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.

# Mali Architecture



- Hardware tiling
- Forward Pixel Kill
  - Reduce overdraw
- Framebuffer memory on-chip
  - 4x MSAA for "free"
  - Advanced on-chip shading
- Bandwidth efficiencies
  - ARM Framebuffer Compression

ARM

- · Transaction elimination
- ASTC

See "Arm Mali GPU Architecture," a presentation by Sam Martin, ARM Graphics Architect at

# Case 1:17-cv-00065-SLR Document 15. Filed 93/23/117m Page 31 of 67 PageID #: 135

"wherein the at least one ALU is operative to read from and write to the memory to perform both texture and color operations; and"

http://malideveloper.arm.com/downloads/ARM\_Game\_Developer\_Days/LondonDec15/presentations/Mali\_G PU\_Architecture.pdf (accessed on 10/27/16).

#### Registers

The ALU pipeline can read/write to 32 128-bit registers, which can be divided into 4 32-bit (highp in GLSL) components (one vec4) or 8 16-bit (mediump) components (two vec4's). Some of the registers, however, are dedicated to special purposes (see below) and are read-only or write-only.

#### **Special Registers**

```
r24 - can mean "unused" for 1-src instructions, or a pipeline register
r26 - inline constant
r27 - load/store offset when used as output register
r28-r29 - texture pipeline results
r31.w - conditional select input when written to in scalar add ALU
```

r0 - r23 is divided into two spaces: work registers and uniform registers. A configurable number of registers can be devoted to each; if there are N uniform registers, then r0 - r(23-N) are work registers and r(24-N)-r23 are uniform registers.

See http://limadriver.org/T6xx+ISA/.

## Case 1:17-cv-00065-SLR Document 1.5. Filed 01/23/117m Page 32 of 67 PageID #: 136

"an output interface configured to send said resultant value to a frame buffer."

an output interface configured to send said resultant value to a frame buffer.

The LG Products include an output interface configured to send said resultant values to a frame buffer.

For example the LG Product includes an internal and external frame buffer. Moreover, the Mali GPU includes an output interface for writing to the frame buffers.

#### Shader Core Architecture



*See* Ryan Smith, ARM's Mali Midgard Architecture Explored, http://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored/4.

ARM

"an output interface configured to send said resultant value to a frame buffer."





- Hardware tiling
- Forward Pixel Kill
  - Reduce overdraw
- Framebuffer memory on-chip
  - 4x MSAA for "free"
  - · Advanced on-chip shading
- Bandwidth efficiencies
  - ARM Framebuffer Compression

ARM

- Transaction elimination
- ASTC

See "Arm Mali GPU Architecture," a presentation by Sam Martin, ARM Graphics Architect at http://malideveloper.arm.com/downloads/ARM\_Game\_Developer\_Days/LondonDec15/presentations/Mali\_G PU\_Architecture.pdf (accessed on 10/27/16).



See http://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored/4.

"40. A device comprising:" The LG 43UH6500 television and X Power LS755 phone (collectively, the "LG Products") include a device. 40. A device comprising: 4K UHD HDR Smart LED TV - 43" Class (42.5" Diag) 43UH6500 Q ZOOM 9 360 VIEW See http://www.lg.com/us/tvs/lg-43UH6500-4k-uhd-tv.

# Case 1:17-cv-00065-SLR Dosument No. 7, 196, 031/23/117 Page 35 of 67 PageID #: 139 "40. A device comprising:"



Case 1:17-cv-00065-SLR Dosument 15. Filed 23/17 Page 36 of 67 PageID #: 140 "40. A device comprising:"



*See* LG LED TV Service Manual, Chassis: UA63J, Model: 43UH6500, p.28, *available at* https://lg.encompass.com/shop/model\_research\_docs/?file=/ZEN/sm/43UH6500UB.pdf.

# **Technical Specifications**

Carrier Boost Mobile®

Display 5.3" (1280 x 720) HD TFT Display

Battery 4,100 mAh non-removable

Platform Android 6.0.1 Marshmallow

Processor MediaTek 1.8 GHz Octa-Core MT6755M

See http://www.lg.com/us/cell-phones/lg-LS755-x-power-boost-mobile.

The SoCs include one of the following ARM Mali graphics processing units (the "Mali GPUs"): T760 MP2 and T860 MP2.

# Case 1:17-cv-00065-SLR Dosument 15. Filed 23/17 Page 37 of 67 PageID #: 141 "40. A device comprising:"

|          |           | M16                       |
|----------|-----------|---------------------------|
|          | СРИ       | CA53 x4 1.1GHz /<br>1MB   |
|          | GPU       | Mali T760 MP2<br>(650MHz) |
| Smart    | OSD       | Separated 2K@60p          |
| Function | HEVC      | 4K @60,10bit              |
|          | DDR       | DDR3-2133/<br>DDR4-2400   |
|          | Audio DSP | HiFi3 Dual @370MHz        |

*See* LG LED TV Service Manual, Chassis: UA63J, Model: 43UH6500, p.123, *available at* https://lg.encompass.com/shop/model\_research\_docs/?file=/ZEN/sm/43UH6500UB.pdf.

## MediaTek MT6755 Helio P10 Specs

| Release  | Q4 2015                      |
|----------|------------------------------|
| Process  | 28nm                         |
| Apps CPU | 8x Cortex-A53, up to 2.0GHz  |
| GPU      | ARM Mali-T860 MP2 at 700 MHz |

See http://cnoemphone.com/blog/mediatek-mt6755-helio-p10-specs-benchmark-and-smartphone-list.

The Mali GPUs share substantially similar structure, function, and operation.



## Case 1:17-cv-00065-SLR Document 1\5. Filed 01/23/17 Page 39 of 67 PageID #: 143

"a plurality of unified shaders synchronized by a clock mechanism to process shading operations together,"

a plurality of unified shaders synchronized by a clock mechanism to process shading operations together,

The LG Products include a plurality of unified shaders synchronized by a clock mechanism to process shading operations together.

For example, the Mali GPUs include a plurality of unified shaders.



See http://www.arm.com/products/multimedia/mali-gpu/high-performance/mali-t860-t880.php.

#### **GPU Architecture**

The "Midgard" family of Mali GPUs (the Mali-T600 and Mali-T700 series) use a unified shader core architecture, meaning that only a single type of shader core exists in the design. This single core can execute all types of programmable shader code, including vertex shaders, fragment shaders, and compute kernels.

*See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.

Further, the plurality of unified shaders are synchronized by a clock mechanism to process shading operations together.

## Case 1:17-cv-00065-SLR Dosument 155 Filed 01/23/17 Page 40 of 67 PageID #: 144

"a plurality of unified shaders synchronized by a clock mechanism to process shading operations together,"

For example, the GPU can issue multiple instructions in parallel for each shader core per clock cycle.

#### **GPU Limits**

Based on this simple model it is possible to outline some of the fundamental properties underpinning the GPU performance.

- The GPU can issue one vertex per shader core per clock
- The GPU can issue one fragment per shader core per clock
- The GPU can retire one pixel per shader core per clock
- We can issue one instruction per pipe per clock, so for a typical shader core we can issue four instructions in parallel if we have them available to run
  - We can achieve 17 FP32 operations per A-pipe
  - One vector load, one vector store, or one vector varying per LS-pipe
  - · One bilinear filtered texel per T-pipe
- The GPU will typically have 32-bits of DDR access (read and write) per core per clock [configurable]

*See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.

Further, for example, the workloads for the Mali GPU are queued and "processed by the GPU at the same time, so vertex processing and fragment processing for different render targets can be running in parallel." Further, the "workload for a single render target is broken into smaller pieces and distributed across all of the shader cores in the GPU."

## Case 1:17-cv-00065-SLR Dosument 15 Filed 01/23/17 Page 41 of 67 PageID #: 145

"a plurality of unified shaders synchronized by a clock mechanism to process shading operations together,"



The graphics work for the GPU is queued in a pair of queues, one for vertex/tiling workloads and one for fragment workloads, with all work for one render target being submitted as a single submission into each queue. Workloads from both queues can be processed by the GPU at the same time, so vertex processing and fragment processing for different render targets can be running in parallel (see the first blog for more details on this pipelining methodology). The workload for a single render target is broken into smaller pieces and distributed across all of the shader cores in the GPU, or in the case of tiling workloads (see the second blog in this series for an overview of tiling) a fixed function tiling unit.

*See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.

"a plurality of unified shaders synchronized by a clock mechanism to process shading operations together,"



## Case 1:17-cv-00065-SLR Dosuppent 1 5. Filed 0:1/23/17 Page 43 of 67 PageID #: 147

"wherein each of the unified shaders comprises: an input interface for receiving a packet from a rasterizer;"

wherein each of the unified shaders comprises: an input interface for receiving a packet from a rasterizer; Each of the unified shaders in the LG Products includes an input interface for receiving a packet from a rasterizer.

For example, "[t]he Mali shader core is structured as a number of fixed-function hardware blocks wrapped around a programmable tri-pipe execution core. The fixed function units perform the setup for a shader operation - such as rasterizing triangles or performing depth testing - or handling the post-shader activities - such as blending, or writing back a whole tile's worth of data at the end of rendering. The tripipe itself is the programmable part responsible for the execution of shader programs."

*See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.

### Shader Core Architecture



ARM

*See* Ryan Smith, ARM's Mali Midgard Architecture Explored, http://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored/4.

## Case 1:17-cv-00065-SLR Document 1\5. Filed 0:1/23/1:7m Page 44 of 67 PageID #: 148

"wherein each of the unified shaders comprises: an input interface for receiving a packet from a rasterizer;"

6 Advanced Graphics Techniques 6.1 Custom shaders

#### Primitive assembly

In primitive assembly the vertices are assembled into geometric primitives. The resulting primitives are clipped to a clipping volume and sent to the rasterizer.

#### Rasterization

Output values from the vertex shader are calculated for every generated fragment. This process is known as interpolation. During rasterization, the primitives are converted into a set of two-dimensional fragments that are then sent to the fragment shader.

#### Transform feedback

Transform feedback, enables writing selective writing to an output buffer that the vertex shader outputs and is later sent back to the vertex shader. This feature is not exposed by Unity but it is used internally, for example, to optimize the skinning of characters.

### Fragment shader

The fragment shader implements a general-purpose programmable method for operating on fragments before they are sent to the next stage.

#### Per-fragment operations

In Per-fragment operations several functions and tests are applied on each fragment: pixel ownership test, scissor test, stencil and depth tests, blending and dithering. As a result of this per-fragment stage either the fragment is discarded or the fragment color, depth or stencil value is written to the frame buffer in screen coordinates.

See "ARM Guide to Unity" Version 2.1 page 6-77 available at http://infocenter.arm.com/help/topic/com.arm.doc.100140\_0201\_00\_en/arm\_guide\_to\_unity\_enhancing\_your\_mobile\_games\_100140\_0201\_00\_en.pdf (accessed 10/27/2016)

## Case 1:17-cv-00065-SLR Dosument 15 Filed 01/23/17 Page 45 of 67 PageID #: 149

"a shading processing mechanism configured to produce a resultant value from said packet by performing one or more shading operations,"

a shading processing mechanism configured to produce a resultant value from said packet by performing one or more shading operations, Each of the unified shaders in the LG Products includes a shading processing mechanism configured to produce a resultant value from a rasterizer packet by performing one or more shading operations.

For example, the Mali GPU has three classes of pipelines "in the tripipe design: one handling arithmetic operations, one handling memory load/store and varying access, and one handling texture access. There is one load/store and one texture pipe per shader core, but the number of arithmetic pipelines can vary." *See* http://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored/4.



*See* http://malideveloper.arm.com/downloads/ARM\_Game\_Developer\_Days/PDFs/2-Mali-GPU-architecture-overview-and-tile-local-storage.pdf.

## Case 1:17-cv-00065-SLR Dosument 185. 7, 196, 031/23/117n 40 age 46 of 67 PageID #: 150

"a shading processing mechanism configured to produce a resultant value from said packet by performing one or more shading operations,"



## Case 1:17-cv-00065-SLR Dosument No. 7, 196, 03/23/11/20 47 of 67 PageID #: 151

"a shading processing mechanism configured to produce a resultant value from said packet by performing one or more shading operations,"



*See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.

## Case 1:17-cv-00065-SLR Dogument 1\5. Filed 03/23/17 4 age 48 of 67 PageID #: 152

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"

wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations The LG Products include shading operations comprising of both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations.

In particular, the shading operations comprise texture operations. For example, the SC includes a texture pipeline, load/store pipeline and a plurality of arithmetic pipelines. The "texture pipeline (T-pipe) is responsible for all memory access to do with textures. The texture pipeline can return one bilinear filtered texel per clock; trilinear filtering requires us to load samples from two different mipmaps in memory." *See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.



 $\textit{See} \ \text{http://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored/4}.$ 

## Case 1:17-cv-00065-SLR Dosument 1\5, Filed 01/23/17 Page 49 of 67 PageID #: 153

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"

#### 7.2.2 About Mali GPU architectures

Mali GPUs use a SIMD architecture. Instructions operate on multiple data elements simultaneously.

The peak throughput depends on the hardware implementation of the Mali GPU type and configuration.

The Mali GPUs contain 1 to 16 identical shader cores. Each shader core supports up to 384 concurrently executing threads.

Each shader core contains:

- · One to four arithmetic pipelines.
- One load-store pipeline.
- · One texture pipeline.

See "ARM Guide to Unity" Version 2.1 page 7-64 available at http://infocenter.arm.com/help/topic/com.arm.doc.100140\_0201\_00\_en/arm\_guide\_to\_unity\_enhancing\_your\_mobile\_games\_100140\_0201\_00\_en.pdf (accessed 10/27/2016).

The arithmetic pipeline ("ALU") performs texture operations. For example, the "ALU pipeline can read/write to 32 128-bit registers" including "texture pipeline results" from the texture pipe.

### Registers

The ALU pipeline can read/write to 32 128-bit registers, which can be divided into 4 32-bit (highp in GLSL) components (one vec4) or 8 16-bit (mediump) components (two vec4's). Some of the registers, however, are dedicated to special purposes (see below) and are read-only or write-only.

#### **Special Registers**

```
r24 - can mean "unused" for 1-src instructions, or a pipeline register
r26 - inline constant
r27 - load/store offset when used as output register
r28-r29 - texture pipeline results
r31.w - conditional select input when written to in scalar add ALU
```

r0 - r23 is divided into two spaces: work registers and uniform registers. A configurable number of registers can be devoted to each; if there are N uniform registers, then r0 - r(23-N) are work registers and r(24-N)-r23 are uniform registers.

See http://limadriver.org/T6xx+ISA/.

The ALU also performs color operations. For example, the "Mali [GPU] only has to write the color data for a single tile back to memory at the end of the tile."

## Case 1:17-cv-00065-SLR Dogument 1\5, Filed 0:1/23/17 Dage 50 of 67 PageID #: 154

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"

 $See \ https://community.arm.com/groups/arm-mali-graphics/blog/2014/02/20/the-mali-gpu-an-abstract-machine-part-2.$ 



*See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/02/20/the-mali-gpu-an-abstract-machine-part-2.



*See* https://community.arm.com/groups/arm-mali-graphics/blog/2012/08/17/how-low-can-you-go-building-low-power-low-bandwidth-arm-mali-gpus.

Moreover, the ALU is responsible for performing the "[m]ath in the shaders[.]"

## Case 1:17-cv-00065-SLR Dosument 1\5, Filed 01/23/17 Page 51 of 67 PageID #: 155

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"

## ARM® Mali™-T628 GPU Tripipe

## **Tripipe Cycles**

- Arithmetic instructions
  - Math in the shaders
- Load & Store instructions
  - Uniforms, attributes and varyings
- Texture instructions
  - Texture sampling and filtering
- Instructions can run in parallel
- Each one can be a bottleneck
- There are two arithmetic pipelines so we should aim to increase the arithmetic workload



**ARM** 

See http://malideveloper.arm.com/downloads/GDC14/Weds/11.15amStreamlineMaliHWCounters.pdf.

Additionally, "there are three classes of execution pipeline in the tripipe design: one handling arithmetic operations, one handling memory load/store and varying access, and one handling texture access. There is one load/store and one texture pipe per shader core, but the number of arithmetic pipelines can vary depending on which GPU you are using; most silicon shipping today will have two arithmetic pipelines." *See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.

*See* http://malideveloper.arm.com/downloads/ARM\_Game\_Developer\_Days/PDFs/2-Mali-GPU-architecture-overview-and-tile-local-storage.pdf.

## Case 1:17-cv-00065-SLR Dosument No. 7, 196, 03/28/11/2 Page 52 of 67 PageID #: 156

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"



## Case 1:17-cv-00065-SLR Dosument 155 Filed 01/23/17 Page 53 of 67 PageID #: 157

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"

| 1.1 | 1 / | About | Mali   | GPHs |
|-----|-----|-------|--------|------|
|     |     | ADOUL | IVIAII | GPUS |

ARM produces the following families of Mali GPUs:

#### Mali Midgard GPUs

Mali Midgard GPUs include the following:

- Mali-T600 series.
- Mali-T720.
- Mali-T760.
- Mali-T820.
- Mali-T830.
- Mali-T860.
- Mali-T880.

#### Mali Utgard GPUs

The Mali Utgard GPUs include the following:

- Mali-300.
- Mali-400 MP.
- Mali-450 MP.
- Mali-470 MP.

\_\_\_\_\_ Note \_\_\_\_\_

The Mali Utgard GPUs do not support OpenCL.

Mali GPUs can have one or more shader cores. Each shader core contains one or more *Arithmetic Logic Units* (ALUs).

The ALUs are based on a *Single Instruction Multiple Data* (SIMD) architecture. Instructions operate on multiple data elements simultaneously.

Mali GPUs run data processing tasks in parallel that contain relatively little control code. Mali GPUs typically contain many more processing units than application processors. This enables Mali GPUs to compute at a higher rate than application processors, without using more power.

See "ARM Guide to Unity" Version 2.1 page 1.1 available at http://infocenter.arm.com/help/topic/com.arm.doc.100140\_0201\_00\_en/arm\_guide\_to\_unity\_enhancing\_your\_mobile\_games\_100140\_0201\_00\_en.pdf (accessed 10/27/2016).

## Case 1:17-cv-00065-SLR Dosument 155. Filed 01/23/17n 17age 54 of 67 PageID #: 158

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"



See ARM, How to Optimize Your Mobile Game with ARM Tools and Practical Examples, p.33, http://malideveloper.arm.com/downloads/GDC15/How%20to%20Optimize%20Your%20Mobile%20Game%20with%20ARM%20Tools%20and%20Practical%20Examples.pdf.

Furthermore, the unified shader comprises at least one ALU/memory pair. For example, as depicted below, each shader core is paired up with shared memory.

## Case 1:17-cv-00065-SLR Dosument N5. 7, 196, 03/23/117n Dage 55 of 67 PageID #: 159

"wherein said shading operations comprise both texture operations and color operations and comprising at least one ALU/memory pair operative to perform both texture operations and color operations"



## Case 1:17-cv-00065-SLR Dosument 1\5, Filed 01/23/17 Page 56 of 67 PageID #: 160

"wherein texture operations comprise issuing a texture request to a texture unit and writing received texture values to the memory and"

wherein texture operations comprise issuing a texture request to a texture unit and writing received texture values to the memory and The texture operations comprise issuing a texture request to a texture unit and writing received texture values to the memory.

For example, as depicted below, the "thread pool" issues texture packets to the "texture pipeline." Moreover, the "texture pipeline (T-pipe) is responsible for all memory access to do with textures. The texture pipeline can return one bilinear filtered texel per clock; trilinear filtering requires us to load samples from two different mipmaps in memory, so requires a second clock cycle to complete."

*See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.



*See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.

## Case 1:17-cv-00065-SLR Dosument 1/5 Filed 01/23/17 Page 57 of 67 PageID #: 161

"wherein texture operations comprise issuing a texture request to a texture unit and writing received texture values to the memory and"

#### **Texture Pipeline**

The texture pipeline (T-pipe) is responsible for all memory access to do with textures. The texture pipeline can return one bilinear filtered texel per clock; trilinear filtering requires us to load samples from two different mipmaps in memory, so requires a second clock cycle to complete.

See https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core

#### The Texture pipeline

Texture accesses use cycles in the Texture pipeline and use memory bandwidth. Using large textures can be detrimental because cache misses are more likely and this can cause multiple threads to stall while waiting for data.

To improve the performance of the Texture pipeline try the following:

#### Use mipmaps

Mipmaps increase the cache hit rate because it selects the best resolution of the texture to use based on the variation of texture coordinates.

#### Use texture compression

This is also good for reducing the memory bandwidth and increasing the cache hit rate. Each compressed block contains more than one texel, so accessing it makes it more cacheable.

#### Avoid trilinear or anisotropic filtering

Trilinear and anisotropic filtering increase the number of operations required to fetch texels. Avoid using these techniques unless you absolutely require them.

See "ARM Guide to Unity" Version 2.1 page 1-6 available at http://infocenter.arm.com/help/topic/com.arm.doc.100140\_0201\_00\_en/arm\_guide\_to\_unity\_enhancing\_your\_mobile\_games\_100140\_0201\_00\_en.pdf (accessed 10/27/2016)

## Case 1:17-cv-00065-SLR Document 1\5, Filed 01/23/17 Page 58 of 67 PageID #: 162

"wherein texture operations comprise issuing a texture request to a texture unit and writing received texture values to the memory and"

## Images and Compute

Another way to update textures

- Compute shaders mandate image load/store operations
  - These have been optional in other shader stages
- Allow random read/write access to a texture bound as an image sampler
  - Use image\*D as shader sampler type
- Layer parameters control whether a single image, or an entire level is made accessible
  - Think texture array or 3D textures

```
// Setup
glGenTextures( ... );
glBindTexture( GL_TEXTURE_2D, texId );
glTextureStorage2D( GL_TEXTURE_2D, levels,
    format, width, height );
glBindImageTexture( unit, texId, Layered,
    Layer, GL_READ_WRITE, GL_RGBA32F );

// Update
glUseProgram( compute );
glDispatchCompute( ... );
glMemoryBarrier( GL_SHADER_STROAGE_BARRIER_BIT );

// Use
glUseProgram( render );
glDrawArrays( ... );
```

**ARM** 

See http://www.gdcvault.com/play/1020140/Getting-the-Most-Out-of.

## Case 1:17-cv-00065-SLR Dogument No. Filed 03/23/17n Dage 59 of 67 PageID #: 163

"wherein texture operations comprise issuing a texture request to a texture unit and writing received texture values to the memory and"



*See* ARM, ARM Tools Part 2, Best Optimization Practices for Mobile Platforms, p.13, *available at* http://malideveloper.arm.com/downloads/ARM\_Game\_Developer\_Days/PDFs/6%20-%20ARM%20Tools%20Part%202-

%20Best%20Optimization%20Practices%20for%20Mobile%20Platforms.pdf.

Furthermore, the received texture values are written into memory.

## Case 1:17-cv-00065-SLR Dosument No. Filed 01/23/11/n 40 age 60 of 67 PageID #: 164

"wherein texture operations comprise issuing a texture request to a texture unit and writing received texture values to the memory and"

### Registers

The ALU pipeline can read/write to 32 128-bit registers, which can be divided into 4 32-bit (highp in GLSL) components (one vec4) or 8 16-bit (mediump) components (two vec4's). Some of the registers, however, are dedicated to special purposes (see below) and are read-only or write-only.

#### **Special Registers**

```
r24 - can mean "unused" for 1-src instructions, or a pipeline register
r26 - inline constant
r27 - load/store offset when used as output register
r28-r29 - texture pipeline results
r31.w - conditional select input when written to in scalar add ALU
```

r0 - r23 is divided into two spaces: work registers and uniform registers. A configurable number of registers can be devoted to each; if there are N uniform registers, then r0 - r(23-N) are work registers and r(24-N)-r23 are uniform registers.

See http://limadriver.org/T6xx+ISA/.

## 

"wherein the at least one ALU is operative to read from and write to the memory to perform both texture and color operations; and"

|                                         | wherein the at least one ALU is operative  | The LG Products include at least one ALU that is operative to read from and write to the memory to perform |  |  |
|-----------------------------------------|--------------------------------------------|------------------------------------------------------------------------------------------------------------|--|--|
| to read from and write to the memory to |                                            | both texture and color operations.                                                                         |  |  |
|                                         | perform both texture and color operations; |                                                                                                            |  |  |
|                                         | and                                        | For example, the ALU is designed to "strike a closer balance between shading and texturing."               |  |  |
|                                         |                                            |                                                                                                            |  |  |

## Case 1:17-cv-00065-SLR Dosument 15. Filed 01/23/11/17 Page 62 of 67 PageID #: 166

"wherein the at least one ALU is operative to read from and write to the memory to perform both texture and color operations; and"



As we've stated before, for our purposes we're primarily looking at the Mali-T760. On the T760 ARM uses 2 ALU blocks per tri pipe, which is the most common configuration that you will see for Midgard. However ARM also has Midgard designs that have 1 ALU block or 4 ALU blocks per tri pipe, which is one of the reasons why seemingly similarly GPUs such as T760, T720, and T678 can look so similar and yet behave so differently.

| ARM Mali Midgard Arithmetic Pipeline Count (Per Core) |   |  |
|-------------------------------------------------------|---|--|
| T628                                                  | 2 |  |
| T678                                                  | 4 |  |
| T720                                                  | 1 |  |
| T760                                                  | 2 |  |

Without being fully exhaustive, among various Midgard designs T628 and T760 are 2 ALU designs, while T720 is a 1 ALU design, and T678 is a 4 ALU design.

As one would expect, the different number of arithmetic pipelines per tri pipe has a knock-on effect on performance in all aspects, due to the changing ratio between the number of arithmetic pipelines and the number of load/store units and texture units. T678, for example, would be fairly shader-heavy, whereas the 2 ALU designs strike a closer balance between shading and texturing. Among the various Midgard designs ARM has experimented with several configurations, and with the T700 series they have settled on 2 ALU designs for the high-end T760 and 1 ALU for the mid-range T720 (although ARM likes to point out that T720 has some further optimizations just for this 1 ALU configuration).

 $\textit{See} \ \text{http://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored/4}.$ 

## Case 1:17-cv-00065-SLR Dosument No. 7, 196, 031/23/117n Dage 63 of 67 PageID #: 167

"wherein the at least one ALU is operative to read from and write to the memory to perform both texture and color operations; and"

Further, the ALUs are operative to read from and write to memory to perform texture and color operations as shown below. Tilelist Reader Rasterizer **Early ZS Testing** Fragment Thread Creator **Vertex Thread** Creator **Thread Pool** Load/Store Pipeline L1 Cache Texture Pipeline L1 Cache **Thread Retire Tripipe** Late ZS Testing Blending **Tile Memory** Tile Writeback See https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machinepart-3--the-shader-core.

## Case 1:17-cv-00065-SLR Dosument 15. Filed 01/23/17 Page 64 of 67 PageID #: 168

"wherein the at least one ALU is operative to read from and write to the memory to perform both texture and color operations; and"



*See* https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core.

## Mali Architecture



- Hardware tiling
- Forward Pixel Kill
  - Reduce overdraw
- Framebuffer memory on-chip
  - 4x MSAA for "free"
  - Advanced on-chip shading
- Bandwidth efficiencies
  - ARM Framebuffer Compression

ARM

- · Transaction elimination
- ASTC

See "Arm Mali GPU Architecture," a presentation by Sam Martin, ARM Graphics Architect at

## Case 1:17-cv-00065-SLR Dosument 15. Filed 01/23/17 Page 65 of 67 PageID #: 169

"wherein the at least one ALU is operative to read from and write to the memory to perform both texture and color operations; and"

http://malideveloper.arm.com/downloads/ARM\_Game\_Developer\_Days/LondonDec15/presentations/Mali\_G PU\_Architecture.pdf (accessed on 10/27/16).

### Registers

The ALU pipeline can read/write to 32 128-bit registers, which can be divided into 4 32-bit (highp in GLSL) components (one vec4) or 8 16-bit (mediump) components (two vec4's). Some of the registers, however, are dedicated to special purposes (see below) and are read-only or write-only.

#### **Special Registers**

```
r24 - can mean "unused" for 1-src instructions, or a pipeline register
r26 - inline constant
r27 - load/store offset when used as output register
r28-r29 - texture pipeline results
r31.w - conditional select input when written to in scalar add ALU
```

r0 - r23 is divided into two spaces: work registers and uniform registers. A configurable number of registers can be devoted to each; if there are N uniform registers, then r0 - r(23-N) are work registers and r(24-N)-r23 are uniform registers.

See http://limadriver.org/T6xx+ISA/.

## Case 1:17-cv-00065-SLR Document 1\5, Filed 01/23/17 Page 66 of 67 PageID #: 170

"an output interface configured to send said value to a frame buffer."

an output interface configured to send said resultant value to a frame buffer.

Each of the unified shaders in the LG Products includes an output interface configured to send said resultant values to a frame buffer.

For example the LG Product includes an internal and external frame buffer. Moreover, the Mali GPU includes an output interface for writing to the frame buffers.

### Shader Core Architecture



See Ryan Smith, ARM's Mali Midgard Architecture Explored, http://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored/4.

ARM

"an output interface configured to send said value to a frame buffer."





- Hardware tiling
- Forward Pixel Kill
  - Reduce overdraw
- Framebuffer memory on-chip
  - 4x MSAA for "free"
  - · Advanced on-chip shading
- Bandwidth efficiencies
  - ARM Framebuffer Compression

ARM

- Transaction elimination
- ASTC

See "Arm Mali GPU Architecture," a presentation by Sam Martin, ARM Graphics Architect at http://malideveloper.arm.com/downloads/ARM\_Game\_Developer\_Days/LondonDec15/presentations/Mali\_G PU\_Architecture.pdf (accessed on 10/27/16).



See http://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored/4.